31 research outputs found
OxfordVGG Submission to the EGO4D AV Transcription Challenge
This report presents the technical details of our submission on the EGO4D
Audio-Visual (AV) Automatic Speech Recognition Challenge 2023 from the
OxfordVGG team. We present WhisperX, a system for efficient speech
transcription of long-form audio with word-level time alignment, along with two
text normalisers which are publicly available. Our final submission obtained
56.0% of the Word Error Rate (WER) on the challenge test set, ranked 1st on the
leaderboard. All baseline codes and models are available on
https://github.com/m-bain/whisperX.Comment: Technical Repor
WhisperX: Time-Accurate Speech Transcription of Long-Form Audio
Large-scale, weakly-supervised speech recognition models, such as Whisper,
have demonstrated impressive results on speech recognition across domains and
languages. However, their application to long audio transcription via buffered
or sliding window approaches is prone to drifting, hallucination & repetition;
and prohibits batched transcription due to their sequential nature. Further,
timestamps corresponding each utterance are prone to inaccuracies and
word-level timestamps are not available out-of-the-box. To overcome these
challenges, we present WhisperX, a time-accurate speech recognition system with
word-level timestamps utilising voice activity detection and forced phoneme
alignment. In doing so, we demonstrate state-of-the-art performance on
long-form transcription and word segmentation benchmarks. Additionally, we show
that pre-segmenting audio with our proposed VAD Cut & Merge strategy improves
transcription quality and enables a twelve-fold transcription speedup via
batched inference.Comment: Accepted to INTERSPEECH 202
With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition
In egocentric videos, actions occur in quick succession. We capitalise on the
action's temporal context and propose a method that learns to attend to
surrounding actions in order to improve recognition performance. To incorporate
the temporal context, we propose a transformer-based multimodal model that
ingests video and audio as input modalities, with an explicit language model
providing action sequence context to enhance the predictions. We test our
approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art
performance. Our ablations showcase the advantage of utilising temporal context
as well as incorporating audio input modality and language model to rescore
predictions. Code and models at: https://github.com/ekazakos/MTCN.Comment: Accepted at BMVC 202
Spot the conversation: speaker diarisation in the wild
The goal of this paper is speaker diarisation of videos collected 'in the
wild'. We make three key contributions. First, we propose an automatic
audio-visual diarisation method for YouTube videos. Our method consists of
active speaker detection using audio-visual methods and speaker verification
using self-enrolled speaker models. Second, we integrate our method into a
semi-automatic dataset creation pipeline which significantly reduces the number
of hours required to annotate videos with diarisation labels. Finally, we use
this pipeline to create a large-scale diarisation dataset called VoxConverse,
collected from 'in the wild' videos, which we will release publicly to the
research community. Our dataset consists of overlapping speech, a large and
diverse speaker pool, and challenging background conditions.Comment: The dataset will be available for download from
http://www.robots.ox.ac.uk/~vgg/data/voxceleb/voxconverse.html . The
development set will be released in July 2020, and the test set will be
released in October 202
25th annual computational neuroscience meeting: CNS-2016
The same neuron may play different functional roles in the neural circuits to which it belongs. For example, neurons in the Tritonia pedal ganglia may participate in variable phases of the swim motor rhythms [1]. While such neuronal functional variability is likely to play a major role the delivery of the functionality of neural systems, it is difficult to study it in most nervous systems. We work on the pyloric rhythm network of the crustacean stomatogastric ganglion (STG) [2]. Typically network models of the STG treat neurons of the same functional type as a single model neuron (e.g. PD neurons), assuming the same conductance parameters for these neurons and implying their synchronous firing [3, 4]. However, simultaneous recording of PD neurons shows differences between the timings of spikes of these neurons. This may indicate functional variability of these neurons. Here we modelled separately the two PD neurons of the STG in a multi-neuron model of the pyloric network. Our neuron models comply with known correlations between conductance parameters of ionic currents. Our results reproduce the experimental finding of increasing spike time distance between spikes originating from the two model PD neurons during their synchronised burst phase. The PD neuron with the larger calcium conductance generates its spikes before the other PD neuron. Larger potassium conductance values in the follower neuron imply longer delays between spikes, see Fig. 17.Neuromodulators change the conductance parameters of neurons and maintain the ratios of these parameters [5]. Our results show that such changes may shift the individual contribution of two PD neurons to the PD-phase of the pyloric rhythm altering their functionality within this rhythm. Our work paves the way towards an accessible experimental and computational framework for the analysis of the mechanisms and impact of functional variability of neurons within the neural circuits to which they belong